Expected Value, Variance, and Why Your Loss Function Is a Statistic

foundations
probability
You minimise loss functions every day. MSE, cross-entropy, MAE, they’re the core of model training. But each one is an expected value in disguise. Understanding that connection changes how you think about every model you build.
Author

Godwill

Published

February 16, 2026

NoteWhat you’ll learn in 10 minutes
  • Expected value is the bridge between probability distributions and the numbers you compute from data, and you’re already using it every time you evaluate a loss function.
  • Variance measures spread around the mean, and it quietly controls everything from feature scaling to overfitting to how noisy your test accuracy is.
  • Every loss function is an expected value in disguise: MSE estimates \(E[(Y-\hat{Y})^2]\) under Gaussian errors, cross-entropy estimates \(-E[Y\log\hat{P} + \ldots]\) under Bernoulli targets, MAE targets the median under Laplace errors.
  • When you pick a loss, you pick a distribution. When you train, you minimise an expected value. When you evaluate, you estimate one from a finite sample.

You already use expected values. You just don’t call them that.

Every time you compute an average, you’re computing an expected value. Every time you evaluate a loss function, you’re computing an expected value. Every time you report “my model’s accuracy is 91%,” that number is an expected value.

The concept hides behind so many familiar operations that most ML engineers never pause to examine it directly. That’s a problem, because expected value isn’t just a formula, it’s the bridge between probability distributions (which describe the world) and the numbers you compute from data (which summarise it).

In the previous article, we established that your data consists of realisations of random variables, and that your model learns a conditional distribution. Now the question is: once you have a distribution, what do you do with it? How do you extract a single useful number from an entire probability distribution?

The answer is expected value. And the specific expected values you’re already computing, without knowing it, are your loss functions.

Expected value: the long-run average

Let \(X\) be a random variable with some distribution. The expected value of \(X\), written \(E[X]\), is the average value you’d get if you could draw from that distribution infinitely many times.

For a discrete random variable with values \(x_1, x_2, \ldots\) and probabilities \(p_1, p_2, \ldots\):

\[E[X] = \sum_i x_i \, p_i\]

For a continuous random variable with density \(f(x)\):

\[E[X] = \int_{-\infty}^{\infty} x \, f(x) \, dx\]

Both say the same thing: weight each possible value by how likely it is, and add them up.

A concrete example

Roll a fair die. The random variable \(X\) is the number that comes up. The expected value is:

\[E[X] = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 5 \cdot \frac{1}{6} + 6 \cdot \frac{1}{6} = 3.5\]

You’ll never roll 3.5. That’s fine. The expected value isn’t a value you expect to see; it’s the centre of gravity of the distribution. It’s the number your sample mean converges to as you roll more and more times.

Expected value of a function

Here’s where things get powerful. You can take the expected value of any function of a random variable, not just \(X\) itself.

If \(g(X)\) is some function of \(X\), then:

\[E[g(X)] = \sum_i g(x_i) \, p_i \quad \text{(discrete)}\]

\[E[g(X)] = \int g(x) \, f(x) \, dx \quad \text{(continuous)}\]

This is called the Law of the Unconscious Statistician (yes, really: the name is a gentle jab at people who use it without realising what they’re doing).

Why does this matter? Because loss functions are functions of random variables. And when you average a loss function over your data, you’re estimating the expected value of that function. Let’s make this precise.

Variance: how spread out is the distribution?

Before we get to loss functions, we need one more concept. Variance measures how spread out a random variable is around its expected value:

\[\text{Var}(X) = E\left[(X - E[X])^2\right]\]

Read it carefully. Variance is itself an expected value, the expected value of the squared deviation from the mean. It’s \(E[g(X)]\) where \(g(X) = (X - \mu)^2\).

The square root of variance is the standard deviation, \(\sigma = \sqrt{\text{Var}(X)}\), which has the same units as \(X\) and is easier to interpret.

A useful alternative formula (derived by expanding the square):

\[\text{Var}(X) = E[X^2] - (E[X])^2\]

This tells you: the variance is the expected value of the square minus the square of the expected value. Or more memorably: “the mean of the squares minus the square of the mean.”

Why variance matters for ML

Variance shows up everywhere:

In your data: The variance of your features determines how much information they carry. A feature with zero variance is useless: it tells you nothing. Feature scaling (standardisation) divides by standard deviation precisely to put all features on equal footing.

In your model: The variance of your model’s predictions across different training sets is what we call “model variance” in the bias-variance tradeoff. High variance means your model is sensitive to which specific data points it was trained on, that’s overfitting.

In your estimates: When you report “accuracy = 0.91”, that number has a variance. It would be different if you’d tested on a different sample. The variance tells you how much to trust it.

Covariance: when random variables move together

When you have two random variables, you often want to know: do they tend to increase together, or does one go up when the other goes down?

Covariance measures this:

\[\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]\]

Positive covariance means they move together. Negative means they move in opposite directions. Zero covariance means there’s no linear relationship (but there could still be a nonlinear one).

Correlation is covariance normalised to lie between -1 and 1:

\[\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \cdot \text{Var}(Y)}}\]

Here’s the ML connection that matters: multicollinearity happens when your feature variables have high covariance with each other. When features are highly correlated, your regression coefficients become unstable: they have high variance, are hard to interpret, and can flip sign. PCA, which we’ll cover in a future article, works by finding directions in feature space along which the variance is maximised while the covariance between the new directions is zero.

Now: the connection that changes everything

Here’s the payoff. Let’s look at what you’re actually computing when you train a model.

You have training data \(\{(x_i, y_i)\}_{i=1}^n\). You have a model \(\hat{y} = f(x; \theta)\) with parameters \(\theta\). You define a loss function \(L(y, \hat{y})\) that measures how bad each prediction is. Then you compute:

\[\hat{R}(\theta) = \frac{1}{n} \sum_{i=1}^n L(y_i, f(x_i; \theta))\]

This is the empirical risk: the average loss over your training data. You minimise it to find the best parameters.

But what are you really doing? You’re computing a sample average of the function \(L(y, f(x; \theta))\). And a sample average is an estimate of an expected value:

\[R(\theta) = E[L(Y, f(X; \theta))]\]

This is the true risk: the expected loss over the entire data-generating distribution. It’s the quantity you actually care about, because it tells you how well your model will perform on new data, not just the training set.

Training is estimating parameters that minimise an expected value. Your loss function is the function inside that expectation.

This is not a metaphor. It’s literally what’s happening when you call model.fit().

Loss functions are expected values: the specifics

Let’s make this concrete for the loss functions you use every day.

Mean Squared Error

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \quad \approx \quad E\left[(Y - \hat{Y})^2\right]\]

The MSE is a sample estimate of the expected squared error. Notice that this looks exactly like the formula for variance, and that’s not a coincidence. If your model predicts the mean perfectly (\(\hat{Y} = E[Y|X]\)), then the remaining MSE is the irreducible variance of \(Y\) given \(X\). That’s the noise in your data that no model can eliminate.

Here’s the deeper insight: MSE is the natural loss function when you assume \(Y|X \sim \mathcal{N}(\hat{y}, \sigma^2)\). Minimising MSE is identical to maximising the likelihood under Gaussian errors. We’ll prove this formally in the next article on likelihood, but for now, know that choosing MSE means assuming your errors are normally distributed. If they’re not, if they’re skewed, heavy-tailed, or heteroscedastic, MSE may not be the right choice.

Mean Absolute Error

\[\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i| \quad \approx \quad E\left[|Y - \hat{Y}|\right]\]

MAE estimates the expected absolute error. Minimising MAE doesn’t correspond to Gaussian errors: it corresponds to Laplace (double exponential) errors. This is why MAE is more robust to outliers: the Laplace distribution has heavier tails than the Gaussian, so it’s less surprised by extreme values.

The best prediction under MAE is the conditional median of \(Y|X\), not the mean. This is a key distinction: MSE targets the mean, MAE targets the median. If your data is skewed, these are different numbers and the choice matters.

Cross-Entropy (Log Loss)

For binary classification with true labels \(y_i \in \{0, 1\}\) and predicted probabilities \(\hat{p}_i\):

\[\text{CE} = -\frac{1}{n}\sum_{i=1}^n \left[y_i \log \hat{p}_i + (1-y_i)\log(1 - \hat{p}_i)\right] \quad \approx \quad -E\left[Y\log\hat{P} + (1-Y)\log(1-\hat{P})\right]\]

Cross-entropy is the expected negative log-probability assigned to the true outcome. It’s what you get when you do maximum likelihood estimation for a Bernoulli distribution, which is exactly what logistic regression does.

Minimising cross-entropy makes your predicted probabilities \(\hat{p}\) as close as possible (in a precise information-theoretic sense) to the true conditional probability \(P(Y=1|X)\).

The unifying view

What you compute What it estimates What it assumes
\(\frac{1}{n}\sum(y_i - \hat{y}_i)^2\) \(E[(Y-\hat{Y})^2]\) Gaussian errors
\(\frac{1}{n}\sum\|y_i - \hat{y}_i\|\) \(E[\|Y-\hat{Y}\|]\) Laplace errors
\(-\frac{1}{n}\sum y_i \log\hat{p}_i + \ldots\) \(-E[Y\log\hat{P} + \ldots]\) Bernoulli targets
\(\frac{1}{n}\sum(y_i\log\frac{y_i}{\hat{y}_i} - y_i + \hat{y}_i)\) \(E\left[Y\log\frac{Y}{\hat{Y}} - Y + \hat{Y}\right]\) Poisson targets

Every loss function in this table is a sample average estimating an expected value. Every one corresponds to a distributional assumption about your target variable. Every time you pick a loss function, you’re making a statistical assumption whether you intend to or not.

Properties of expected value that make ML work

A few properties of \(E[\cdot]\) are quietly holding the entire ML pipeline together:

Linearity: \(E[aX + bY] = aE[X] + bE[Y]\)

This is why you can decompose complex losses into simpler components. It’s why regularised loss = data loss + penalty works mathematically. It’s why ensemble methods (averaging multiple models) reduce expected error.

Law of Large Numbers: As \(n \to \infty\), the sample mean \(\bar{X}_n \to E[X]\)

This is why training on more data works. Your empirical risk (sample average loss) converges to the true risk (expected loss) as your dataset grows. With enough data, the training loss becomes a reliable estimate of the test loss.

Iterated Expectation (Tower Law): \(E[E[Y|X]] = E[Y]\)

This says: if you first compute the expected value of \(Y\) within each group defined by \(X\), and then average those, you get the overall expected value of \(Y\). This is the mathematical foundation of the bias-variance decomposition, which we’ll derive in a future article.

The sample mean is an estimator, and it has variance

When you compute \(\bar{x} = \frac{1}{n}\sum x_i\) from data, you’re estimating \(E[X]\) from a finite sample. This estimate is itself a random variable: if you’d drawn different data, you’d get a different \(\bar{x}\).

The variance of the sample mean is:

\[\text{Var}(\bar{X}) = \frac{\text{Var}(X)}{n}\]

This one formula explains several things at once:

Why more data helps: As \(n\) increases, \(\text{Var}(\bar{X})\) decreases. Your estimate becomes more precise. This is why training on more data is almost always beneficial: you’re reducing the variance of your loss estimates.

Why your test accuracy fluctuates: If you evaluate on 100 test samples vs. 10,000, the variance of your accuracy estimate differs by a factor of 100. Small test sets give noisy estimates.

Why mini-batch SGD works: Each mini-batch gives you a noisy estimate of the gradient (which is itself an expected value). The variance of that estimate decreases with batch size. Smaller batches = noisier gradients = more variance but faster updates. Larger batches = smoother gradients = less variance but slower updates. The tradeoff is directly governed by \(\text{Var}(\bar{X}) = \text{Var}(X)/n\).

What this means in practice

Understanding that loss functions are expected values gives you three practical capabilities:

1. You can diagnose loss function mismatch. If your residuals are heavily skewed, MSE (which assumes Gaussian errors) is pulling your predictions toward a mean that isn’t representative. Switch to MAE (which targets the median) or a quantile loss. If your count data has overdispersion, Poisson deviance will underestimate uncertainty, consider negative binomial loss instead. The distributional assumption behind your loss function must match the actual behaviour of your data.

2. You can understand why your evaluation metrics are noisy. Your reported accuracy, F1 score, or AUC are all sample averages estimating expected values. They have variance. Reporting them without confidence intervals is like reporting a point estimate without uncertainty: technically a number, practically meaningless. We’ll cover confidence intervals properly in a future article.

3. You can reason about the bias-variance tradeoff. The expected test error decomposes into bias squared plus variance plus irreducible noise. All three terms are expected values. Understanding this decomposition requires understanding expected value first, which is why we’re building this foundation before tackling bias-variance directly.

The mental model to take away

Before: “I pick MSE for regression and cross-entropy for classification because that’s what the tutorial said.”

After: “MSE estimates \(E[(Y - \hat{Y})^2]\) under a Gaussian assumption. Cross-entropy estimates \(-E[Y\log\hat{P} + (1-Y)\log(1-\hat{P})]\) under a Bernoulli assumption. I choose based on the distributional properties of my target variable, and I check whether those assumptions hold.”

That’s the shift. Loss functions aren’t arbitrary choices or convention. They’re expected values derived from specific distributional assumptions. When you pick a loss, you’re picking a distribution. When you train a model, you’re minimising an expected value. When you evaluate, you’re estimating an expected value from a finite sample.

Next week, we’ll close the loop. You now know your data comes from distributions (article 2) and your training process minimises expected values of loss functions derived from those distributions (this article). The missing piece is: how exactly do you estimate the distribution’s parameters from data? That’s likelihood: the single most important concept in ML that nobody teaches properly.


This is article 3 of Stats Beneath, a weekly series on the statistical foundations of machine learning. Subscribe to get each article when it’s published.